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Abstract 



Learning machines that have hierarchical structures or hidden variables are singular statistical 
models because they are nonidentifiable and their Fisher information matrices are singular. In singular 
statistical models, neither does the Bayes a posteriori distribution converge to the normal distribution 
nor does the maximum likelihood estimator satisfy asymptotic normality. This is the main reason 
that it has been difficult to predict their generalization performance from trained states. In this 
paper, we study four errors, (1) the Bayes generalization error, (2) the Bayes training error, (3) 
C/3 , the Gibbs generalization error, and (4) the Gibbs training error, and prove that there are universal 

mathematical relations among these errors. The formulas proved in this paper are equations of states 
in statistical estimation because they hold for any true distribution, any parametric model, and any 
^SJ ■ a priori distribution. Also we show that the Bayes and Gibbs generalization errors can be estimated 

^ ' by Bayes and Gibbs training errors, and we propose widely applicable information criteria that can 

CO , be applied to both regular and singular statistical models. 

■ 1 Introduction 

(N 

T 1 _ 

^ ■ Recently, many learning machines are being used in information processing systems. For 

example, layered neural networks, normal mixtures, binomial mixtures, Bayes networks, 

•rH , 

^ ■ Boltzmann machines, reduced rank regressions, hidden Markov models, and stochastic 

H ■ 

\ context-free grammars are being employed in pattern recognition, time series prediction, 

robotic control, human modeling, and biostatistics. Although their generalization per- 
formances determine the accuracy of the information systems, it has been difficult to 
estimate generalization errors based on training errors, because such learning machines 
are singular statistical models. 

A parametric model is called regular if the mapping from the parameter to the prob- 
ability distribution is one-to-one and if its Fisher information matrix is always positive 
definite. If a statistical model is regular, then the Bayes a posteriori distribution converges 
to the normal distribution, and the maximum likelihood estimator satisfies asymptotic 
normality. Based on such properties, the relation between the generalization error and 
the training error was clarified, on which some information criteria were proposed. 

On the other hand, if the mapping from the parameter to the probability distribution 



1 



is not one-to-one or if the Fisher information matrix is singular, then the parametric 
model is called singular. In general, if a learning machine has hierarchical structure or 
hidden variables, then it is singular. Therefore, almost all learning machines are singular. 
For singular learning machines, the log likelihood function can not be approximated by 
any quadratic form of the parameter, with the result that the conventional relationship 
between generalization errors and training errors does not hold either for the maximum 
likelihood method [S] 00 Bayes estimation p[2]- Singularities strongly affect gener- 
alization performances [I5] and learning dynamics [Ij. Therefore, in order to establish 
the mathematical foundation of singular learning theory, it is necessary to construct the 
formulas which hold even in singular learning machines. 

Recently, we proved [13] [IS] that the generalization error in Bayes estimation is asymp- 
totically equal to A/ra, where A > is the rational number determined by the zeta func- 
tion of a learning machine and n is the number of training samples. In regular statistical 
models, A = (i/2, where d is the dimension of the parameter space, whereas in singular 
statistical models, A depends strongly on the learning machine, the true distribution, and 
the a priori probability distribution. In practical applications, the true distribution is 
often unknown, hence it has been difficult to estimate the generalization error from the 
training error. To estimate the generalization error when we do not have any informa- 
tion about the true distribution, we need a general formula which holds independently of 
singularities. 

In this paper, we study four errors, (1) the Bayes generalization error Bg, (2) the Bayes 
training error Bt-, (3) the Gibbs generalization error G^, and (4) the Gibbs training error 
Gti and prove the formulas 

E[Bg\-E[Bt] = 2l3{E[Gt]-E[Bt]) + o{^), 

E[Gg]-E[Gt] = 2(5{E[Gt]-E[Bt]) + o{^), 

where E[-] denotes the expectation value and < /? < oo is the inverse temperature of the 
a posteriori distribution. These equations assert that the increased error from training 
to generalization is in proportion to the difference between the Bayes and Gibbs training 
errors. It should be emphasized that these formulas hold for any true distribution, any 
learning machine, any a priori probability distribution, and any singularities, therefore 
they reflect the universal laws of statistical estimation. Also, based on the formula, we 
propose widely applicable information criteria (WAIC) which can be applied to both 
regular and singular learning machines. In other words, we can apply WAIC without any 
knowledge about the true distribution. 
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This paper consists of six parts. In Section 2, we describe the main results of this 
paper. In Section 3, we propose widely applicable information criteria and show how 
to apply them to statistical estimation. In Section 4, we prove the main results in the 
mathematically rigorous way. In Sections 5 and 6, we discuss and conclude of this paper. 
The proofs of lemmas are quite technical hence they are presented in Appendix. 

2 Main Results 

Let {il,B,P) be a probability space, and X : Q ^ be a random variable whose 
probability distribution is q{x)dx. Here denotes the N dimensional Euclidean space. 
We assume that the random variables Xi,X2, .., X^ are independently subject to the same 
probability distribution as X. In learning theory, q{x)dx is called the true distribution 
and Dn — {Xi, X2, Xn} is a set of training samples. A learning machine is defined 
by a parametric probability density function p{x\w) of x e R^ for a given parameter 
w E W C IV^, where is a set of parameters. An a priori probability density function 
ip{w) is defined on W. The Bayes a posteriori probability density p{w\Dn) for a given set 
of training samples Dn is defined by 

1 " /3 

p{w\Dn) = — (p{w) (j[p{Xi\w)') , 

where /3 > is the inverse temperature and C„ > is the normalizing constant. The 
expectation value with respect to this probability distribution is denoted by Eya[-]- Also 
Eu„ [■] and Ex [■] denote respectively the expectation values over and X. We sometimes 
omit Dn and simply use E[-]. We study the four errors, defined below. 
(1) Bayes generalization error. 



Bg = Ex 



log 



EMX\w) 



(2) Bayes training error. 



n 



E^[p{Xj\w)\ 



(3) Gibbs generalization error, 

<1{X) 



G„ = E, 



w 



Ex[log 



p{x\wy 



(4) Gibbs training error. 

These four errors are measurable functions of D„, hence they are also random variables. 
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Remark. The Bayes generalization error is equal to the Kullback-Leibler distance from 
the true distribution q{x) to the Bayes predictive distribution Eyj\p{x\w)]. The Gibbs 
generalization error is equal to the average of the Kullback-Leibler distance from the 
true distribution to the Gibbs estimation. They show the accuracy of Bayes and Gibbs 
estimations, it is important for statistical learning machines to be able to estimate them 
from random samples. 

We need some mathematical assumptions which ensure that the theorems hold. Let 
us define a log density ratio function by 

f{x,w) = log 

p[x\w) 

In this paper, we mainly study the singular case, that is to say, the situation when the 
set of true parameters {w e W;q{x) — p{x\w)} consists of more than one point and 
the Fisher information matrix is not positive definite. We assume the following three 
conditions. 

(A.l) Assume that the set of parameters is a compact set which is the closure of an 
open set in R''. The set W is defined by 

W ^{w e R'^;7ri(w) > 0, •••7rfe(w) > 0}, 

where 7ri(i/;), • • • , Trk{w) are analytic functions, and the a priori probability density (f{w) 
is given by (p{w) = (pQ{w)(pi{w) where (Pq{w) > is a C°°-class function and (pi{w) > is 
an analytic function. 

(A. 2) Let s > 6 be a constant, and L^{q) be the complex Banach space defined by 

L\q)^{f(x); I \f(x)\'q(x)dx < oc}. 

Assume that there exists an open set W C which contains W such that the function 
W 3 w /(-jw) is an L^{q) valued analytic function. 

(A. 3) Let 14^0 = {w E W ; q{x) = p{x\w)} be the set of true parameters. The set Wq is 
not the empty set and there exists an open set W* C which contains W such that for 
M{x) = sup^g^J/(X,'u;)|, 

Ex[snp \f{X,w)\']<oo. 

w£W* 

and there exists t > such that, for Q{x) = sup p{x\w) 

K{w)<t 

j M{xfQ{x)dx < oo. 

Remark. These assumptions arc needed for the mathematical reasons. 

(1) These conditions allow for the case that the set of true parameters Wq = {w e 
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W;q{x) = p{x\w)} is not a single point but an algebraic set or an analytic set with 
singularities. In general, the Fisher information matrix has zero eigenvalues. On the 
other hand, in conventional statistical learning theory, it is assumed that Wq consists of 
one point and the Fisher information matrix is positive definite. On the assumptions of 
this paper, we can not use any result of conventional statistical learning theory. 

(2) The condition that W is compact is necessary because, even if the log density ratio 
function is an analytic function of the parameter, \w\ = oo is a singularity in general. 
For this reason, if W is not compact and Wq contains \w\ = oo, the maximum likelihood 
estimator does not exist in general. In fact, if x = (xi,X2), w = {a,b), and f{x,w) = 
{x2 — a sin(6a;i))^/2, and Wq contains {a = 0}, then the maximum likelihood estimator 
never exists. On the other hand, if |w| = oo is not a singularity, R'^ U {\w\ = 00} can be 
understood as a compact set and the same theorems established in this paper hold. 

(3) The condition that tti{w), ...,TTk{w) and fi{w) are analytic functions is necessary 
because if one of them is a class function, there exists a pathological example. In fact, 
if <fi{w) = exp(— l/||i/7|p) in a neighborhood of the origin and the set of true parameters 
is the origin, then the four errors may not be in proportion to 1/n. 

(4) The condition s > 6 is needed to ensure the existence of the asymptotic expansion of 
the Bayes generalization error in our proof. (See the proof of Theorem [H) 

(5) Some non-analytic statistical models can be made analytic. For example, in a simple 
mixture modelp(x|a) = api{x) + {l—a)p2{x) for some probability densities (x) andp2{x), 
the log density ratio function f{x, a) is not analytic at a = 0, but it can be made analytic 
by the representation p{x\9) = a^pi(x) + P'^p2{x), on the manifold 9 G {a^ + Z?^ = 1}. As 
is shown in the proofs, if W is contained in an analytic manifold, then the same theorems 
hold as stated in this paper. 

(6) Note that 




(1) 



Based on assumptions (A.l), (A. 2), and (A. 3), we prove the following results. 



Theorem 1 (1) There exist random variables B*, , G*, and such that, as n ^ 00 
the following convergences in law hold. 



nBg^B;, nBt^Bl nGg ^ G; 



nG, 



Gl 



(2) As n ^ 00, the following convergence in probability holds, 



n{Bg -Bt-Gg + Gt) ^ 0. 
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(3) The expectation values of the four errors converge as follows, 

E[nBg] ^ E[Bll E[nB,] ^ E[Bll 
E[nGg]^E[Gll E[nGt] ^ E[Gl]. 

For the proof of this theorem, see Section |H the following Theorem is the main result of 
this paper. 

Theorem 2 (Equations of States in Statistical Estimation^. The following equa- 
tions hold. 

E[BI]-E[B;] = 2(3{E[G;]-E[Bn), (2) 

e[gi]-e[g;] = 2(3{e[g;]-e[b:]). (3) 

Remark. (1) Theorem [2] asserts that the increases of errors from training to prediction 
are in proportion to the difference between the Bayes and Gibbs training errors. We 
refer to Theorem [2] as Equations of States in Statistical Estimation, because they 
hold for any true distribution, any learning machine, any a priori distribution, and any 
singularities. It is proved that the equations of states hold even if the true distribution is 
not contained in the parametric model [22] . 

(2) Although the equations of states hold universally, the four errors themselves depend 
strongly on a true distribution, a learning machine, an a priori distribution, and singu- 
larities. 

(3) Theorem [2] also asserts a conservation law, namely, the difference between the Bayes 
error and the Gibbs error is invariant between training and generalization, 

e[g;] - e[b;] = e[g;] - E[Bn. (4) 

As is shown in Theorem [H this conservation law holds not only for expectations, but also 
for the random variables, as the number of training samples tends to infinity. 



Corollary 1 The two generalization errors can be estimated by the two training errors, 

(5) 



E[B;] \_1-2[3 2(3 \ Em 
E[Gl] ) \ -2(3 1 + 2(3 )\ E\G1\ 



Remark. (1) From eq.Q, it follows that 

/ E[G:] \ ^(1-2(3 2(3 \( E[Gl] \ 
\ E[Bt] ) \ -2^ l + W l\ E[Bl\ j' 

which shows that there is a symmetry between generalization errors and training errors. 
(2) Since the set of eigenvalues of the linear transform in eq.([S]) is {!}, and the dimension 
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of the linear invariant subspace is one, there is no conservation law other than eq.(jl]). 
(3) A statistical model is called regular ii the set of true parameters Wq = {w E W; q{x) = 
p{x\w)} consists of a single point and if the Fisher information matrix is always positive 
definite. Note that a regular model is a very special example of singular learning machines. 
For a regular statistical model, we have 

e[b;] = ^^, e[g;] = {1 + ^)^, 

which is a special case of Theorem [2l 

Theorem [2] reveals the universal relations among the four errors. It holds even if the 
set of true parameters has complex singularities. However, its statement simultaneously 
shows that we can extract no information about singularities directly from Theorem [2l 
Theorem [3] shows that the four errors contain important information about singularities. 
The KuUback-Leibler distance is 

q{x) 



K{w)=Ex[f{X,w)] = f q{x)log^^dx 

J mxlw) 



' p{x\w) 

The zeta function of a learning machine is defined by 



Ciz) = J^K{wy ip{w) dw. (6) 

The zeta function is a holomorphic function of a complex variable z in the region Re{z) > 
0, which can be analytically continued to a meromorphic function on the entire complex 
plane. Its poles are all real, negative, and rational numbers (for the proof, see [^[9][T7]). 
They are denoted as follows, 

> -Ai > -A2 > -A3 > ■ • • . 

The order of each pole A^ is denoted by rrik- We simply use notations A = Ai and m = mi 
for the largest pole and its order respectively. 

Theorem 3 As n 00, the convergence in probability 

2A 

nGg + nGt- — ^0 

(5 

holds. Therefore 

2A 

E[Gl]+E[Gl] = j. (7) 

Also the following corollary holds. 
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Corollary 2 The following convergence in probability holds, 

nBg - nBt + 2nGt - ^ 0. 
In particular, if j3 = 1, E[B*] = A. 



From these theorems and corollaries, if one knows the true distribution, one can predict 
the Bayes and Gibbs generalization errors from the Bayes and Gibbs training errors with 
probability one, as n tends to infinity. In practical applications, we seldom know the 
true distribution, however, this fact is useful in computer simulation research of learning 
theory and statistics. Lastly, by Theorems [2] and [3l the following corollary is immediately 
proved. 

Corollary 3 Let u = v{(3) = f3iE[G;] - E[B;]). Then 

Em = 

eig:] = 4 + ". 



Therefore Bayes learning is asymptotically determined by A and v. 

In general depend on /5 > 0. In regular statistical models, A = z/ = (i/2 for arbitrary 
/3 > 0, whereas in singular learning machines, they are different in general. Corollary 
[2] was firstly discovered in [13] Since the constant A depends strongly on the true 
distribution, the learning machine, and the a priori distribution, it characterizes the 
properties of learning machines. The values of several models have been studied in neural 
networks [12], normal mixtures [21], reduced rank regressions [2], Boltzmann machines 
[25] , and hidden Markov models [26]. Also the behavior of A was analyzed for the case 
when Jeffreys' prior is employed as an a priori distribution [T3], and in the case when the 
distance of the true distribution from the singularity is in proportion to l/^/n |18j . 

3 Widely Applicable Information Criteria 

The main purpose of this paper is to prove the theorems above. However, in order 
to illustrate the importance of the results of this paper, we propose widely applicable 
information criteria and introduce an experiment. Experimental analysis of practical 
applications is a topic for future study. 
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3.1 Basic Concepts 

Based on Corollary [H we establish new information criteria which can be used for both 
regular and singular learning machines. Let us define the Bayes generalization loss, the 
Bayes training loss, the Gibbs generalization loss, and the Gibbs training loss by 

BLg = -Ex[\ogE^[p{X\w)]l 
1 " 

BLt = —Y,\ogE^[p{X^\w)l 

GLg = -E^Ex[\ogp{X\w)], 
1 " 

GLt = -E4-J2\ogpiX,\w)]. 

n j=i 

These losses are random variables. Both training losses BLt and GLt can be numeri- 
cally calculated based on training samples Z)„ and a learning machine p{x\w) without 
any knowledge of the true density function q{x). By combining the entropy of the true 
distribution with Corollary [H 

/I " 
q{x) log q{x)dx = [- V log q{Xi 

we obtain the equations, 

E\BL^\ = E[BLt] + 2l3{E[GLt]-E[BLt]) + o{^), 
E[GLg] = E[GLt]+2(3{E[GLt]-E[BLt]) + oi^). 

Let us define widely applicable information criteria (WAIC) by 

WAlCi = BLt + 2(3 {GLt- BLt), 
WAIC2 = GLt + 2p {GLt -BLt). 

Then the expectations of the two criteria respectively equal the Bayes and Gibbs gener- 
alization losses, 

E[BLg] = E[WAICi] + o(i), 

E[GLg] = E[WAlC2] + o{^). 

Therefore, WAICi and WAIC2 provide indices for model evaluation. 

Remark. If a model is regular and the true distribution is contained in the parametric 
model, then X = d/2 and 

2(3{E[Gl] - E[B*t]) = d (8) 
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hold. It is proved in [22j that, even if a model p{x\w) does not contain the true distribution 
q{x), the equations of states hold if the Hessian matrix of the Kullback-Leibler distance is 
positive definite at the unique optimal paramater w* that minimizes the Kullback-Leibler 
distance from q{x) to p{x\w). In such a case, 

2PiE[G:]-E[B;])=tTiIJ-'), (9) 

where / and J are d x d matrices defined by 

lij = J dif{x,w*)djf{x,w*)q{x)dx, 
Jij = —J didjf{x,w*)q{x)dx. 

Here we used a notation, di = {d/dwi). Moreover, as n — > oo convergence in probability 

2/3(G: - Bl) tr(/J-^) (10) 

holds. If /3 — s> oo, both the Bayes and Gibbs estimations result in the maximum likelihood 
method. Therefore, for regular statistical models, WAIC has asymptotically the same 
variance as AIC In other words, WAIC can be understood as information criteria of 
generalized from AIC. For singular learning machines, neither eq.(l8l) nor ([9]) holds, for 
example, does not exist, whereas WAIC gives the accurate generalization error. 

Remark. In Bayes estimation, the marginal likelihood or the stochastic complexity 

/n 
(f>{w) '^p{Xi\w)d'w 
1=1 

is often used in model selection and hyperparameter optimization. We clarified its behav- 
ior for singular learning machines in [15]. In regular statistical models, F is asymptotically 
equal to BIC, however, in singular models, it is not equal to BIC even asymptotically. 
Note that F does not correspond to the generalization error, hence the optimal model 
for the minimizing F does not minimize the generalization error in general. The Bayes 
and Gibbs generalization errors are important because they corresond directly to the 
Kullback-Leibler distance from the true distribution to the estimated one. In this paper, 
we make mathematically new information criteria which correspond to the generalization 
error. Even for regular statsitcal models, there is much research and discussion which 
compares AIC with BIC. It is a topic for future study to compare the marginal likelihood 
and the equations of states from the viewpoint of statistical methodology. 

Remark. In conventinal Bayes estimation, the inverse temperature /5 = 1 is used. Hence 
WAIC for /? = 1 is most important. On the other hand, WAIC for general (3 shows the 
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-C/[VVAiOiJ 




\ 




fi 21 'i31 8 


034043 


fi 21 41 


2304fi5 


2 




3.013187 


0.118109 


2.993593 


0.225722 


3 


0.027000 


0.028422 


0.007393 


0.025139 


0.006886 


4 


0.030000 


0.030830 


0.007678 


0.027207 


0.008176 


5 


0.032000 


0.033030 


0.008418 


0.030152 


0.008728 


6 


0.034000 


0.034978 


0.008832 


0.031382 


0.009778 



Table 1: Experimental Results 

effect of the inverse temperature on the generahzation and training errors. Moreover, in 
apphcations, one may use /5 as a hyperparameter. In such a case, it can be optimized by 
the minimization of WAIC. 

3.2 Experiments 

We studied reduced rank regressions. The input and output vector is x = (xi,X2) G 
R^^ X Qjid the parameter is if = (A, B) where A and B are respectively NiX H and 
H X N2 matrices. The learning machine is 

p{x\w) = g(xi)-^^— ip^exp(-^||x2 - BAxif). 

Since q{xi) has no parameter, it is not estimated. The true distribution is determined 
by matrices Aq and Bq such that rank(i?o^o) = Hq. The algebraic variety of the true 
parameters is defined by K{A, B) = 0, where 

K{A,B) oc \\BA-BoAo\\\ 

has complicated singularities. We conducted experiments for the case that A^^i = A'"2 = 6, 
Hq = 3, (3 = 1, n = 500, and a = 0.1. The a priori distribution was p{A, B) oc exp(— 2.0 ■ 
10'^(||y4|p + Reduced rank regressions with hidden units H = 1,2, ..,6 were 

employed. The a posteriori distribution was numerically approximated by the Metropolis 
method, where initial 5000 steps were omitted and 2000 parameters were collected after 
every 200 steps. The expectation values Bg and WAICi were obtained by averaging 
over 25 trials, that is to say, 25 sets of training samples were independently taken from 
the true distribution. In Table. 1, theoretical values of E[Bg\ for (3=1 were obtained 
from [2j. Learning machines with H = 1,2 do not contain the true distribution, hence 
theoretical values do not exist. The two values E[Bg\ and (y[Bg\ are the experimental 
average and standard deviation of the Bayes generalization error, respectively. The two 
values E[WAICi] and a [WAIC 1] are the experimental average and standard deviation of 
WAICi, respectively. The experimental results show that the average behavior of the 
Bayes generalization error could be estimated by that of WAICi. However, the standard 
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deviations of the WAICi and the Bayes generahzation error are not smalL Note that, 
even in regular statistical models, the standard deviations of the generalization error and 
AIC are also not small. 



4 Singular Learning Theory 

In this section, we shall prove the main theorems. Proofs of the lemmas are rather 
technical, hence they are given in Appendix. 

4.1 Outline of the Proof 

We prove the main theorems by the following procedure. 

(1) Firstly we show that only the neighborhoods of the true parameters essentially affect 
the four errors. 

(2) By using resolution of singularities, the set of parameters can be understood as the 
image of an analytic map from a manifold, on which all singularities of the true parameters 
are of normal crossing type. 

(3) We prove that the four errors converges in law to functionals of a tight gaussian process 
on the set of true parameters in the manifold. 

(4) Expectations of the four errors converge to those of functionals of the tight gaussian 
process. 

(5) The relations between the four errors are derived by partial integration of the gaussian 
process. 

4.2 Basic Properties 

By using the log density ratio function f{x,w), we define the empirical KuUback-Leibler 

distance by 

1 " 

^nH = -E/(^-^)- 
^ 1=1 

For a given constant a > 0, we define an expectation value restricted to the set {w e 

W; K{w) < a} by 



K{w)<a 

lK{w)<a 

We define four errors respectively by 



JK{w)<a 



JK(w)<a 



Bg{a) = £;x[-log£;^[e-^(^'-)|,,(^)<J 
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K{w)<a\ 



1 " 

Gg{a) = E^[K{w)\K{w)<a], 

Gt{a) = E^[Kn{w)\K(w)<a]- 

Since W is compact and K{w) is an analytic function, K = sup^g^ K{w) is finite. Then, 
Bg(K) = Bg, Bt(K) = Bt, Gg(K) = Gg, and Gt(K) = Gf Also we define r]n{w) for w 
such that K{w) > by 

K{w) - Knjw) 
Vn{w) = == , (11) 



and 



Ht{K) is denoted by Ht 



Ht{a) = sup \r]n{w)\^. 

0<K{w)<a 



Lemma 1 For an arbitrary a > 0, the following inequalities hold. 

Bt{a) < Gt{a) < ^G,(a) + ^//^(a), 
< Bg{a) < Gg{a), 
-\H,{a) < Gt{a). 

For the proof of this lemma, see Section [71 In particular, by putting a = K, we have 

Bt<Gt< ^Gg + ^H,, 

0<Bg< Gg, 

~-Ht < Gt. 

Remark. A sequence of random variables {Rn} is called asymptotically uniformly inte- 
grable (AUI) if 

lim limsup„^^£'[/M(i?n)] = 0, 

M—^00 

where 

/ < M) 



\x\ {\x\ > M) ■ 



The following properties are well known 

(1) If the convergence in law Rn —>■ R holds and i?„ is AUI, then E[Rn] — > E[R]. 

(2) If Rn is AUI and if a random variable Sn satisfies |S'„| < Rn, then S'„ is also AUI. 

(3) If there exist p > and C > such that E[\Rn\P] < G, then Rl < q < p) is AUI. 

By Lemma [H if nHt{a), nGg{a), and nBt{a) are AUI, then nBg{a) and nGt{a) are AUI. 
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Lemma 2 (1) There exists a constant Ch > such that 

E[{nHtf] =Ch<'x. 

(2) For an arbitrary a > 0, 

Pr{nHt > n") < (12) 

For the proof of this lemma, see Section [71 Lemma [2] shows that nHt is asymptotically 
uniformly integrable. 

Lemma 3 (1) The four errors riBg, nBt, riGg, and nGt are all asymptotically uniformly 
integrable. 

(2) For an arbitrary e > 0, following convergences in probability hold 

n{Bg-Bg{e)) ^ 0, 
n{Bt-Bt{e)) ^ 0, 
n{Gg-Gg{e)) 0, 

For the proof of this lemma, see Section [71 Based on this Lemma, Bg{e), Bt{e), Gg{e), 
and ^((e) are referred to as the major parts of the four errors. 

4.3 Resolution of Singularities 

By Lemma[3l the main region in the parameter set to be studied is 

W, = {weW ; K{w) < e} 

for a sufficiently small e > 0. By applying Hironaka's resolution theorem to K{w){e — 
K{w))(pi{w)Tii{w) ■ ■ -TTkiw), there exist a manifold Ai = UaUa where Ua is a local coor- 
dinate and a proper analytic map g : Ua ^ W^, expressed a.s w = g{u), such that in each 
Ua, the functions K{w), {e — K{w)), fi{w), vri(w), ■ ■ -, and iTk{w) are all normal crossing. 
That is to say, 

Kigiu))=u'' = l[uf\ 
i=i 

and 

^{giu))\g'iu)\ = biu)\u^\=biu)\l[u';^\, 

i=i 

where is the Jacobian determinant, k = {ki, k2, k^) and h = {hi, h2, .., hd) are 

sets of nonnegative integers, and b{u) > is a C°° class function. Note that g{u), k, and 
h depend on the local coordinate Ua, however, to keep notation simple, we omit a that 
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identifies the local coordinate. By applying partitions of unity to we can assume that 
g~^{W) is the union of coordinates [0, l]'^ and that 

ip{giu))\g'iu)\=u'^^{u), 

where ip{u) > is a C°° class function, without loss of generality. Existence of such a 
manifold Ai and an analytic map w = g{u) is well known in algebraic geometry [TU] . 
algebraic analysisjH |9], and learning theory [15]. Since is compact and g is a. proper 
map, g~^iW^) is also compact. For our purpose, we need only the compact subset g~^{W^) 
in Ai. Therefore, hereinafter we use the notation Ai for g^-^lWe), which is a compact 
subset of the manifold. The set of true parameters is denoted by Wq = {w & W ; K{w) = 
0} and Mo = {u e M ] K{g{u)) = 0}. 
Let us define the supremum norm by 

11/11 = sup 

Then we have a standard form of the log density ratio function. 

Lemma 4 There exists an L^{q) valued analytic function M. 3 u ^ a{x,u) G L'^iq) such 
that 

f{x,g{u)) = a{x,u)u'', (13) 
Ex[a{X,u)] = u\ (14) 
K{g{u)) = ^ Ex[a{X,uy]=2, (15) 
ExlHXW] < oo. (16) 

This lemma shows that, if there are only normal crossing singularities in the parameter 
set, the ideal generated by the set of true parameters is trivial, with the result that the 
log density ratio function is also trivial. For the proof of this lemma, see Section [71 We 
define ||a(X)|| = sup„g;\^ \a{X,u)\. 

4.4 Empirical Processes 

An empirical process ^n{u) is defined by 

1 " 

where a*{x,u) = Ex[a{X,u)] —a{x,u). Note that |^n(M)| = \Vn{g{u))\, where rjniw) in 
eq. lfTT]) is ill-defined on K{w) = on W , but ^„(m) is well-defined on K{g{u)) = on 
M.. In other words, resolution of singularities ensures rjn is well-defined. We have the 
following Lemma. 
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Lemma 5 The empirical process satisfies 

E[Un\f] < Const. < oo 

^[llV^nf] < Const. < oo 
where Const, does not depend on n, and ||V^„|| = J2'j=i \\dj^n\\- 
Let the Banach space of uniformly bounded and continuous functions on Ai be 

B{M) = {f{u); 11/11 <oo}. 

Since is compact, B{Ai) is a separable normed space. It was proved in |19j that the 
empirical process ^n{u) defined on B{Ai) weakly converges to the tight gaussian process 
^(m) that satisfies 

miu^v)] = Ex[a*{X,u)a*{X,v)]. 

If u,v e Mo, 

Ex[a*{X,u)a*{X,v)] = Ex[a{X,u)a{X,v)]. 

It is well known that a tight gaussian process is uniquely determined by its expectation 
and the covariance matrix of finite points. In a singular learning machine, the Fisher 
information matrix is singular, however, Ex[a{X,u)a{X,v)] can be understood as a gen- 
eralized version of the Fisher information matrix. 

Let ^(m) be an arbitrary different iable function. We define the average of f{u) over A4 
for the given function ^{u) by 

I J{u)Z{u,i)du 



J2 , ,,^(^'0 du 

^ J OA]-! 



'[0,1] 

where J2a is the sum over all coordinates of Ai, a is a constant which satisfies < a < 1 
and 



Lemma 6 Assume that ki > 0. For an arbitrary analytic function ^{u), 

+a\\a{X)\\+(r\\dia{X)\\}, 



+ {a\\a{X)\\f' + {a\\dMX)\\f'}, 
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where di = (d/dui), and Ci, 02,03 > are constants which are determined by ki,hi, j3, 
and ||^||||1/^||. 

Note that, by Lemma El Gg{t) is asymptotically uniformly integrable. For the proof of 
this Lemma, see Section [71 

Since w = g{u), we rewrite the major parts of four errors by using the emprical process 

B,ie) = i?x[-logi?°[e-(^'")"'=|e„]], (17) 
Bt{e) = -X:-logE°[e-'^(^-")"''|a], (18) 

G,(e) = E^IU (19) 
Gtie) = E:[u''-^u'Uu)\U (20) 

In each local coordinate [0, l]'^, without loss of generality, we can assume that there exists 
r such that 

u = {x,y) e R' X R*"', 
where r' = d — r, multi-indeces k = {k,k') and h = {h,h') satisfy 

hi + 1 _ _ hr + l _ ^ < ^1±1 < 
2/cji ^JvY 

where (— A^) and r are respectively equal to the largest pole and its order of the mero- 
morphic function that is given by the analytic continuation of 

/ u^'^'+'^du. 

J [0,1]'* 

We define the multi-index /i = (/ii, fXr') £ R-^' by 

IJ,i = h\- 2k[Xa- 

Then 

hence is integrable in [0, 1]^'. Both and r depend on the local coordinate. Let A be 
the smallest Aq,, and m be the largest r among the coordinates for which A = A^. Then 
(—A) and m are respectively equal to the largest pole and its order of the zeta function 
of eq.(in])- Let a* be the index of the set of all coordinates that satisfy A^ = A and r = m. 
As is shown by the following lemma, only the coordinates affect the four errors. Let 
Y,oL* denote the sum over all such coordinates. 
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For a given function f{u), we adopt the notation fo{y) = f{0,y). For example, 
ao(X, ?/) = a{X,0,y), ^o{y) = C,{0,y), and ipoiv) = V'(0,y)- The expectation value for 
a given function C,{u) is defined by 

J2 I dtj dyf{y,t)Zoiy,t,0 

EyAfiy^iM = ^ foo — 7 

dt dyZoiy,t,0 

a* ^ 

where / dy denotes /[o_i]r' dy and 
Then we have the following lemma. 

Lemma 7 Let p > be a constant. There exists ci > such that, for an arbitrary 
-class function f{u) and analytic function i{u), the following inequality holds, 

n^El[u'^'f{um- EyAt'f.ivm 

< 7^exp(4/3||ef ){/3||V^||||/|| + IIV/II + ||/||} 
logn 

where || V/|| = ■ 

We define four functionals of a given function ^{u) by 

BliO = \Ex[EyMX,y)t''\f], (21) 
Bm ^ Gm-GliO + B;{0, (22) 
^^(0 = EyMl (23) 
G;{0 = Ey,[t-t'/%{ym. (24) 

Note that these four functionals do not depend on n. From the definition, we can prove 
the following lemma. 



Lemma 8 For an arbitrary real measurable function ^(m), 

2A 
J' 



4.5 Proof of Theorem [T] 

Firstly we show that the following convergences in probability hold. 





-b;{U - 


- 0, 


nBt{e) 


-Bti^n) - 


0, 


nGg{e) 




0, 


nGt{e) 


-g;{u - 


0. 



(25) 
(26) 
(27) 
(28) 
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Based on eq. (fT^ and eg. (12^ . we obtain eg. (1271) by Lemma [71 Also based on eg. (12Ui) and 
eg. (121]) . we obtain eg. fl25]) by Lemma [71 To prove eg. f[2Sl) . we define 



E 



X 



logE°[e-'^°(^'")"'|a] 



then, it follows that nBg{e) 



nB,{e) 



nhgil) and there exists < cr* < 1 such that 
nEllu^'lir] - '^ExEl[a{X,u)W'\ir.] 
+ '^ExEl[a{X,uy\i^f + \nhf{a*), 



(29) 



where we have used Ex[(i{X,u)\ = u^. The first term on the right hand side of eg. f[2^ is 
nGg{e). By Lemma [71 we can prove the convergence in probability 

nExE'MX,ufu'''\Q-ExEy^t[ao{X,y)H\^r. 



< 



Cl 



logra 



e^(^\l^-\\'Ex[P\\VU\HXf\\ + l|Va(X)^ 



\a{Xf 







(30) 



holds. The proof of eg. (!30|) is as follows. Two empirical processes ^n{u) and d^{u) 
respectively converge in law to ^{u) and dC,{u) in the Banach space with the sup norm 
II II . Therefore, their continuous functionals ||^n||, ll^^^nll; and e^^"^""^ also converge in law. 
Note that l/logra goes to zero. In general, if a seguence of random variables converges 
to zero in law, then it converges to zero in probability, hence we obtain the convergence 
in probability eg. f[30|) . In the following proofs, we use the same method. 

Since Ex[ao{X,y)] = 2, the sum of the first two terms of the right hand side of 
eg. f[29|) converges to zero in probability. For the third term, by using the notation 
Ex[a{X,u)a{X,v)] = p{u,v), poiu,y) = p{u,{0,y)), and pooiy',y) = p((0, y')> (0, y)), 
and applying Lemma [71 

\nExE'MX,uym' - Ey,t[ao{X,y)t'/'m'\ 
< V^E'Ju'{V^E'M^,v)v']~Ey,[po{u,y)t'/']) 



y,t 



< 



logn 
logn 



t'/\V^E^M^,y)u'] - Ey,APoo{y\y){t'ty/']) 
Va||||p|| + ||Vp|| + ||p||) 

(/9||Ven||||p|| + ||Vp|| + 



+ 



(31) 



where '|^„' is omitted to keep the notation simple. The eguation (131 p converges to zero 
in probability by Lemma [6l Therefore the difference between the third term and B* (^„) 
converges to zero in probability. For the last term, we have 



irii 



b^'\cr*)\ = \Ex{Ef[a{X,ufu''m+2Ef[a{X,u)\Cn? 
-3E:* [a{X, ufu'^lQEf [a{X, u)u\^„ 
< 6nEx\HX)f Ef[u''^\^„ 
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By applying Lemma El 

\^bf\a*)\ < ^Ex[\\a{x)r{i + Unr+mnr 

+ \\a{X)r/' + \\da{X)r% (32) 

which shows that nh^p{a*) converges to zero in probabihty. Hence eq. fl25|) is proved. Let 
us prove eq. (!26|) . By defining 

it follows that nBt{e) = nbt{l) and there exists < cr* < 1 such that 

1 " 

nB,{e) = nG,(e)--^E°[a(X,,«)V^|a] 

-in -1 

Then by applying Lemma El nb\ (cr*) converges to zero in probability in the same way as 
for eq.( l32l) . By the same methods as used with eq. (|30ll and eq. (!3Tl) . replacing respectively 
£^x[||o(-^)^||] and p{u, v) with (l/n) J2j W^^i^jYW and p„ = (l/n) J2j Ci{Xj, u)a{Xj, v), con- 
vergences in probability 

1 " 

1 " 

hold, with the result that the convergence in probability 

nBtie) - nGt{e) + nGg{e) - nBg{e) 0. (33) 

holds. Therefore eq. (l26!l is obtained. By combining eq. (l25|l -eq. (l28|) with Lemma [3] (2), 
the following convergences in probability hold, 

nBg-B;{U ^ 0, (34) 

nBt-B;{^n) ^ 0, (35) 

nG3-G*(a) ^ 0, (36) 

riGt-Gli^n) 0. (37) 

Four functionals B*{^), B^{^), G*g{^), and Q (0 are continuous functions of ,^ G B{M). 
From the convergence in law of the empirical process ^„ ^, the convergences in law 

B;ii.)^B;io, i?:(U-^i?:(0, 
G;iU^G;{0, g;{^^)^g;{0 
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are derived. Therefore Theorem [T] (1) and (2) are obtained. Theorem [T] (3) is shown in 
Lemma El (Q.E.D.) 

4.6 Proof of Theorem [2] 

Let {{xi,gi);i = 1,2, A^} be a set of independent random variables which are subject 
to the probabihty distribution 

q{x) 



/27r 

A tight gaussian process is defined by 

1 " 

Then, in the same way as the convergence in law ^„(m) ^{u) was proved, the covergence 
in law Cri(w) can be proved, because Cn{u) has the same expectation and covariance. 

E[Uu)] = 0,, (38) 
E[Uu)Uv)] = Ex[aiX,u)aiX,v)]. (39) 

In other words, both (n{u) and ^„('u) converge in law to the same random process ^(m). 
Moreover, we can prove that Cn(^) satisfies < oo (s > 6) in the same way. There- 

fore we can prove equations of a gaussian random process ^(m) by using the convergence 
in law Cn{u) ^{u). Since gi is subject to the standard normal distribution, 

E[g,Fig,)] = E[^n9i)] (40) 



holds for a differentiable function of F{x) which satisfies |-F(x)|/|x|'^, |-F(x)'|/|x|^ 
(|a;| oo) for some /c > 0. 

Let us prove Theorem [2l We use the notation, 

Y{a) = r dtt^-' e-^'+''^^\ 
Jo 

J du* = J dx dy 5{x) y'^, 

a* 

Z(0 = jdu*Y{i{u)), 

where u = {x, y). Also we define the expectation value of /(m, t) for a given function ^(m), 

_ Jdu*J,^dtf{u,t) e-^*+^(")^^ 

Note that Lemma [8] is equivalent to 

(2t)^ - {Vtau))i = J. 
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By this equation and \ Vt^iu)\ < {t + ^{uf)/2, 

< I + (41) 

hold for an arbitrary function ^(u). Note that {C,{uY)^ < II^IP; because || || is the sup 
norm. The expectations of B*, G*, and can be written by 

1 ^rJdu*a{X,u)Y'{^{u))-.2 



2E[B*] = -E[E_ 



E[G,] - j,E[ ], 

E[GA = j,E[ ]--, 

where A is a constant defined by 

•^=^1 — W) — 

We introduce An by using Cniu), 



An = E\ 



Z{Cn) 



1 " 

i=i 



Then by eq. p2!) . {Cn{u)Vt)(„ is asymptotically uniformly integrable, hence An ^ A 
{n — >• oo). On the other hand, we define 



1 A 9 



Bn = -^Y(3E[—{a{xi,u)Vtgi) 
y'^ i=i '^9i 



= E[J du {-^Y.a{x.,u)-}- 

Then by using 



d fY'{Uu))\ Y"{Uu))a{x,,u) 



we have 



Hence 



r- \2 I dv* Y'{Cn{v))a{xi,v), 

o2 n 

i=i 
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Also, 

in n 

Uur<-{Y.a{x,,ur){Y^g^). (43) 



^ fc=i k=i 



From eq. (HT]) . eq.( l42i) . and eq.(H3l). both {a{xi,u)^/t)c^^ and {d / dgi){a{xi,u)^/t)c_^ are 
bounded by a finite sum of quadratic forms of gi. Hence by eq.f l40l) . An = Bn- Lastly, since 

((1/^) Er= 1 o-{xi, uYt)^ri ^^'^ ((1/ V^) X^ILi '^(^^i) V^)(„ asymptotically uniformly in- 
tegrable by eq. (HTi) . eq. (H2]) . we obtain _B„ — > 5, where 



z(0 ^ z{0 

= 2(3'^E[Gl]-2(3^E[B*g]. 

Here we have used ExiaiXju)^] = 2 for K{g{u)) = by Lemma |H Since A„ = B^, 
An ^ A, and B„ B, we have A = B. Therefore 

A = f3{E[G;]-E[G;]), 

which completes Theorem [21 (Q.E.D.) 

4.7 Proof of Theorem [5] 

From Lemma [HI it follows that 



2A 
J' 

Then by Theorem [1] and Lemma [3], we obtain Theorem [31 (Q.E.D.) 
5 Discussion 

In this section, we discuss the theorems in this paper. 

Firstly, Theorem [H was derived from definitions of the four errors. As is shown in the 
proof, 

Bt = Gt-G, + Bg + Op{-), 

n 



where Op{l/n) is a random variable whose order is smaller than 1/n and 



'p{Xj\w) 



Here convergences in probability n{Gg — Gg) and n{Bg — Bg) — > hold. We need 
the information about the true distribution to calculate both Gg and Bg^ however, we do 
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not need it to calculate 



in 1 

V ^ 2{Gg-Bg) = -Y^E^[ilogpiX^\w)Y] --J2E^[logpiXj\w)]\ 
^ j=i j=i 

The random variable V is the variance of the a posteriori distribution. By using V, 

WAICi and WAIC2 can be replaced by 

WAICi = BLt + (3V, 
WAIC2 = GLt + (3V. 

The third criterion WAIC3 

WAICs = BLt - GLt + Gg-Bg 

can be used as an index to examine how precisely the asymptotic theory holds. In other 
words, the value liyAJCal is the error of the asymptotic theory. 

Secondly, let us study Theorem [2l This theorem is essentially derived from the fact 
that the empirical process ^n{u) converges to the tight gaussian process ^{u) and that the 
partial integration formula 

holds for ^(m). 

Thirdly, Theorem [3] is proved by the property of the integral 

(•00 



J du* dt t^-' e-^*+»^^. 



/o 

That is to say. Theorems [2] and [3] are essentially proved by partial integration. 

Fourthly, in this paper, we proved three results eqs.([2]), ([3]), and (j7]). The two relations 
of eq.([2]) and eq.([n]) hold universally, independently of singularities, whereas the third 
relation of eq.([7]) depends strongly on singularities. To determine the values of the four 
errors, one more relation is needed. However, it seems that there is no such relation. 
Hence in order to determine the four errors, we may have to evaluate at least one of the 
four errors. For example 

E[Gt] = -^E[-\ogZ,{mn)) 

It is conjectured that this value is determined by the generalized Fisher information matrix 
Ex[a{X,u)a{X,v)] on the set of true parameters Aio. To investigate this problem in a 
mathematically rigorous way is a problem for future study. 

Fifthly, we assumed that the log density ration function f{x,w) is an L^(g)-valued 
analytic function. Even if /(x, ) is not analytic, if /(x, ) = u^a{x,) holds and a(a;, ) 
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satisfies some assuptions proved in Lemmas, then the theorem holds. However, if f{x, ) is 
not analytic, then there is examples in which f{x, ) = u'^a{x, ) does not hold and it is not 
easy to judge whether /(x, ) = u^a{x, ) holds or not. It is the future study the equations 
of states in this paper in the more weak conditions. 

Lastly, let us compare the result of this paper with the asymptotic theory of regular 
statistical models. In regular statistical models, the set of true parameters consists of just 
one point, Wq = {wq}. By the transform w = ga{u) = Wq + I{woy^'^u, where I{w) is the 
Fisher information matrix, 

K{9oiu)) = i|«r, 

2 ^/n 



where I{wo) is Fisher information matrix and = (^^(l)? ^n(2), ...,^nid)) is defined by 



1 " 9 

^n{k) = — logp(Xi|^o(M)) 

Vn ~l ouk 



u=0 



Here each C,n{k) converges in law to the standard normal distribution. Statistical learning 
theory for regular models is based on the convergence in law ^„ — > C,, whereas that for 
singular models, it is baesd on the fact that ^(^^)- 

6 Conclusion 

Based on singular learning theory, we established the equations of states in learning, and 
proposed widely applicable information criteria. 
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7 Appendix 

7.1 Proof of Lemma [T] 

Since Bg{a) is the Kullback-Leibler distance from q{x) to Ew[p{x\w)\K{w)<t], Bg{a) > 0. 
Using Jensen's inequality, 

we have Bg{a) < Gg{a) and Bt{a) < Gt{a). If < K{w) < a, 

Kniw) = K{w) - ^K{w) r]n{w) 
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Hence —Ht{a)/A < Gt{a). Also we have 



Kniw) < ^K{w) + ^Vniwf. (44) 
Therefore Gt{a) < lGg{a) + \Ht{a). (Q.E.D.) 

7.2 Proof of Lemma [2] 

(1) For any e > and a > 0, by the definition of rin{w), 

1 1 " 



is an empirical process and f{x,w) is an analytic function of w, hence 

E[ sup I's/n'?™!^] < const. 

t<K(w)<a 

[23] [IS] [20] ■ It is proven in Lemma O that E[{nHt{e)Y] also satisfies the same inequality. 
(2) Let the random variable S be defined by 

( 1 {iinHt> n°) 



( otherwise) 



S = 

Then E[S] = Pr{nHt > n") and 

Ch = E[{nHtf] > E[{nHtf S] > E[S] n'", 
which completes the Lemma. (Q.E.D.) 

7.3 Proof of Lemma [3] 

We use the notation, 

Siifiw)) = I fiw)e-''^''-^^^ ipiw)dw, 

JK{w)>e 

SoUM) = I /(w) e-"'^^"("'V(w)ciw;. 

JK{w)<t 

By using the inequahty, 



we have inequalities for arbitrary f{w),g{w) > 0, 



^i(/(^)) < {snpf{w))e--^^/'exp{^nH,), 

w Z 



So{g{w)) > Co {inig{w)) n ^exp{-^nHt), 
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where (—A) is the largest pole of ({z) and Cq > is a constant which satisfies the inequality 

3pn 



[ exp( ^K{w))ipi{w)dw > 

JK{w)<e 2 



Hence 



where 



Then 



Si{f{w)) ^ sup^/(w) 

Co 



sin) = — e-^M-^^nm. 



I logs(n)| < n(3e/2 + n(3Ht + Alogra + | logco|. 
By using the function M{x) > used in eq.([T]), we define M„ by 

Then 

ml] < E[(ZmX,)lnf)] < E[(ZM{X,fln)] = Ex[M{Xf] < oo. 
(1) Firstly, we study Bayes generalization error. 

! -f(X,w)l 

n{Bg-Bg{e)) = nEx[-\og- ^ ' 



Therefore 



c ( -f(x,w)\ s" n 

c ( -f{x,w)\ 9 n 1 

n\B,-B,ie)\ < nEx[log{l + ) + ^og{l + -^^)] 

< nEx[\og{l + s{n) e^^^P- + log(l + s{n))] 

< nEx[log(l + s(n) e^^'^^^))] + ns{n). 

The second term converges to zero in probability because of Lemma [21 Let fi {n) be the 
first term, 

Mn)=nEx[log{l + s{n) e^^^^)]. 

Let us define 

p^(r)-l ^ (2M(x) > n(3e/4) 

- I (2M(x) < nf3e/4) ' ^^^^ 

Then by using log(l + x) < x and log(l + e^) < |x| + 1, 

AH = nEx[(l-ei(X))log(l + s(n)e2*^W)] 
+nEx[e^{X) log(l + s{n) e^*^^^))] 

< ns{n) exp{nl3e/A) 
+nEx[&i{X){2M{X) + \ \ogs{n)\ + 1)], 
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which converges to zero in probabihty because, from the inequahty eq.([T]), 

Ex[e,{X)M{X)] < {^/E[M{Xn 

Ex[e^{X)] < (-i-)^E[M(X)% 

It follows that ni^Bg — Bg{e)) 0. Secondly, we prove the convegence in probability 
n{Bt - Bt{e)) ^ 0. 



n 



Bt-Bt{t)\ < X^{log(l + s(n)e'^"Pl^(^-")l)+log(l + s(n))} 



< 5^1og(l + s(n)e2*^(^^)) + nlog(l + s{n)) = (46) 
i=i 

where eq. (H6|) is the definition of L„. To prove the convergence in probability L„ 0, it 
is sufficient to prove convergence in mean E[Ln] —>■ 0. Let the random variable G2 be 

f 1 {nHt> nf3e/4) , . 

^2 - j {nHt<n(3e/4) " > 

Then 

E[L^] = E[L^{l-e2)]+E[L^Q2] 

< nEx[log(l + (nVco)e2^-^(^)-"'^^/^)] 
+n^+^exp(-ra/5e/4)/co 
+E[02n(2M„ + |logs(n)| + 1)] 
+E[e2n{\\ogs{n) \ + 1)] 

The first term goes to zero can be proved in the same way as /i(^) 0. The second term 
goes to zero as a real sequence. Both the third and fourth terms go to zero because 

E[e2nM^] < nPr{nHt > ny^^E[M^]^/^, 
E[ne2inPe)] = n^(3e Pr{nHt > n/?e/4), 
E[nQ2inHt)] < nPr{nHt > n(3e/AY''^E[{nHtfY''^, 

and by using Lemma 2. Thus we obtain n{Bt — Bt{t)) — > 0. Thirdly, the Gibbs general- 
ization error can be estimated as 

S^{K{w)) + S,{K{w)) nSo{K{w)) 



n\Gg-Gg{e)\ < 



< 



n 



5o(l) + 5i(l) 5o(l) 
nSi{K{w)) , nSo{K{w))Si{l) 



5o(l) 5-0(1)2 
< 2nKs{n), (48) 
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which converges to zero in probabihty. Lastly, in the same way, the Gibbs training error 
satisfies 

n\Gt — Gt{e)\ < 2n s{n) sup|K'„(w)| 
< 2n s{n) Mn 

which converges to zero in probabihty. 

(2) Firstly, from Lemma [2|, nHt is AUI. Secondly, let us prove nBf is AUI. Let L„ be the 
term in eq. (l46l) . Then 

\nBt\ < \nBt{e)\+Ln. 
Moreover, by employing a function, 

1 

bis) = —Y.\ogE^[e~'f^''^'% 
there exists < s* < 1 such that 

- nb{l) - • 

Hence 

n 

\nBt\ <J2snp\fiX„w)\ < nM^ 

Therefore 

\nBt\ < \nBt{t)\+B\ 

where 

J nMn {nHt > e/?n/4) 
-\ L„ {nHt<e(3n/A) ' 

By summing the above equations, 

EllnBtl"^/^] < E[2\nBtie)\'^/^] + E[2{B*f^^]. 

In Lemma O we prove that i?[|ni?f(e)|^/^] < oo. By Lemma [2] (2) with 5 such that 
= ef3n/A, we have P{Ht > e/3/4) < C'^j/n^, hence 

E[{B*f^^] < E[Q2iB*Y/^] + E[{1 - Q2){B*f^^] 

+E[(i-e2)(L„p] <oo. 

The first term is finite because -£'[02] = Pr{nHt > nj3e/4). Finiteness of the second term 
can be proved in the same way as proving that E[{1 — 02)L„] — » 0. Hence \nBt\ is AUI. 
Lastly, we show that nGg is AUI. From eq. fHSl) . 

< nGg < nGg{e) + 2n s{n) K. 
29 



Moreover, always nGg < nK, by definition. Therefore 

nGg < nGg{e) + K* 

wliere 

~ \Kns{n) {nHt < n'^^^) 

f nK {nHt > n^^^) 
- [K e-"^^/3 (^^^ < ^2/3^ • 

Tlien 

< E[{nGgf/^] < E[2{nGg{e)f/^] + E[2{K*f/^]. 

It is proven in Lemma E] tliat E[{nGg{e)Y/'^] < oo. By Lemma [2] with 6 = 2/3, we have 
P{nHt > n^/^) < Gn/n^, hence 

E[{K*f^] < ^3/2^'/'^ + :^e-"^^/2 < ^_ 

Hence nGg is AUI. Since E[{nHtf] < oo, E[{nBtf/^] < oo, and E[{nGgf/^] < oo all four 
errors are also AUI by Lemma [H (Q.E.D.) 

7.4 Proof of Lemma |4] 

By the definition of the Kullback-Leibler distance and f{x,g{uj) = \og{q{x)/p{x\g{u))), 
for arbitrary u & A4, 

K{g{u)) = I f{x,g{u))q{x)dx 

^^-f{x,g{u)) ^ f(^x^g(^u))) - l)q{x)dx 

where < t* < 1. Let U' be a neighborhood of m = 0. For arbitrary L > the set is 
defined by 

I)l = {xgR^; snp \f{x,g{u))\<L}. 

u&U' 

Then for any u & U', 

u > e q{Xjdx, 



Id^ 2 

with the result that, for any ^ Q {u E U'), 

1>^-V ^%#^«W<i- (49) 

Since f{x,g{u)) is an L*(g)-valued real analytic function, it is given by an absolutely 
convergent power series, 

fix,g{u)) = ^a„(a;)M" 

a 

= a{x,u)u^ + h{x,u)u^ , 
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where 



a>k 

b{x, u) = ggl 



XjU 



n—k 



a<k 

and J2a>k denotes the sum over indices that satisfy 

ai>ki {i = l,2,...,d) (50) 

and J2a<k denotes the sum over indeces that do not satisfy eq. (l50|) . Here a{x,u) is an 
L*(g)-valued real analytic function. From eq. fHQj) . for an arbitrary u'' {u E U'), 

1 > / {a{x,u) + b{x,u)yq{x)dx 

Jdl 

> / b{x,uf'q{x)dx — / a{x,uyq{x)dx. 

2 Jdl Jdl 

Here |a(a;, u)| is a bounded function of m G U' . If b{x,u) = does not hold, then 

\b{x,u) \ — s> oo (m — > 0), hence we can choose u and Dl so that the above inequality does 

not hold. Therefore, we have b{x,u) = 0, which shows eq.( fT3l) . From 

= J f{x, g{u))q{x)dx = J a{x,u)u''q{x)dx, 

we obtain eq. (fT4|) . To prove eq.( fT5i) . it is sufficient to prove Ex[a{X,uy] = 2 when 
K[g{u)) = 0. Let the Taylor expansion of f{x,g{u)) be 

f{x,g{u)) = J^o-Mu". 

a 

Then 

\a.(T)\ < - 



/ M M(x) , , 

a„(x)|<-^ (51) 



where R is the associated convergence radii and 



a(x, m) = ^ aaix^v!^ 



a>k 

Hence 



Hx,u)\ < ^-^r 



a>k 

M{x 



M{x)^a-k 



cr 



Ri^ ' 

where Ci > is a constant. For arbitrary u {u'^ ^ 0), 



1= f <^^^)\ -eaMu\ 



2 _ 'q{x)dx, 
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where < t* < 1. Put 



Then 



Six,u) < ci^^max{l,e-'^(-'")"'}g(x) 

= ^1 j^2k max{g(x),p(x|w)| 

M(x)2 
^ Ci-^Q(x). 

By the fundamental condition (A. 3), M{x)'^Q{x) is an integrable function, hence S{x,u) 
is bounded by the integrable function. By using Lebesgue's convergence theorem, as 
^ 0, we obtain 

for any u that satisfies u'^'' = 0, which proves eq. ffTSl) . Lastly, since f{x,u) is an L'^{q) 
valued analytic function, a{x,u) is also an L^{q) valued analytic function. Moreover, 
eq.(|5T|) shows eq.^. (Q.E.D.) 

7.5 Proof of Lemma [5] 

The proof is given in [19] and Theorem 39 in j20j . 

7.6 Proof of Lemma [6] 

Let u = {ui,U2, ...,Ud)- Since at least one of non-negative integers ki,..,kd is not equal 
to zero, we can assume ki > 1 without loss of generality. Put g{u) = U2^---u^'' and 
h{u) = ■ ■ ■M^'^. Then u'' = u\^g{u), = u\^h{u), where either g{u) or h{u) do not 
depend on Ui. We adopt the notation. 



'[0, 

f(u) = (3V^u''^{u) + au''a{X,u), 



By the definition and Ci 



iVo 



By applying partial integration to N2, 

^ J [0,1]^ 2 pnki ^ ^ 
h{u) 



^J[o,ir Wnk 



E / e-^""^'^'+/(") {h, + l + uM{u)) du. 
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From the definition of f{u) 

uidifiu) = PV^{hu''C{u)+u''dii{u)) 

+akiu''a{X, u) + av!'dia{X, u). 

By using inequalities 



and \u'^\ < 1, 



\uidj{u)\ < (^{h{nu'' + liein + nu"' + \\d,a'} + kia\\a\\ + a\\d,a\\. 
Hence 

witfi tfie result that 

N2 1 r, , 9 ,, r^ ..,,9 + (T||()|alh 

'Wo ^ 2* + ' + + ll^'^ll + % }- 

where 2;i = {3ki — l)/(4A;i), which shows the first half of the lemma. Let us prove the 

latter half. Firstly, 

In the same way as for the first half, by applying partial integration, we have 

^3 < E / e-^""''+^(«) ^h, + h + l + u,dj{u)) du. 

^ J[o,i]d 2(3nki 

Therefore, we obtain 



No ~ 2(3kxn^No 2 No 



x{k, + h, + l + ^IICIP + + A;ic7||a|| + a\\d,a\\)]. 

Therefore 

By using Caucy-Schwarz inequality, that is to say, Ni/No < {N2/NoY^'^, and and by 
applying the result of the first half and Holder's inequahty. 



No - 2(3kin^No' L ^ ' ^ ' '2 2 
^ -Sj{l + lieir + l|5ieir + a||a|H-a||5ia||}'^ 
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where C > is a constant which is determined by ki,hi, and (3. In general, 

^ k=i ^ k=i 

which completes the proof. (Q.E.D.) 

7.7 Proof of Lemma [7] 

For given functions ^{u) and g{u), we define 

Then 

It is rewritten as 

g) = y r dt I dx dy 6(t - nx^'^y'"'') x^y^' g(x, y) — e-f^^-^^^i^-^v) , 
To analyze function, we need the fact that, for Re{z) > 0, 

/ (a x'^^Y x^ dx = IT / x^''^^^''^ dxj 



2'' A;i ■ ■ • fe, (;z + A„)^ 

By applying the inverse Mellin transform to this equation, we have 

5it - ax^') xUx=l ^o^(log f )^~^ (0 < t < a) 
[o,iY I U (otherwise) 



An^o,go)=T. dt rfyco^^e-^-/^v/^«o(^)(log^)-^,o(l/). (53) 



where cq = l/(2'^(r - l)!A;i ■ • • k^). If (?o(z/) = ^?(0, y) and eo(z/) = ^(0, y) then 

iy Co- 

iKny^'^ <n 

where the region 't < ny"^^' < n' denotes the set {y G [0,1]*; t < ny"^^' < n}. Then by 
using eq.( l53i) . 

< e,||j||e-««'"/=<!H5^, (54) 
where Ci > is a constant. In the same way, 

\m^g)\ > c;min|,| e-^^l'^l'V^ (l^i^I^. (55) 
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Let A be the smallerst value in {Aa;a}. Then (—A) is equal to the largest pole of (i^)- 
The coordinate [/„ whose Aq, is equal to the smallest one Aq = A and whose r is equal 
to the largest one r = m is denoted by Ua*- The sum J2a* denotes the sum restricted 
to such coordinates. Let A^{^,g) be the sum of AP{^,g) restricted in this way, in other 
words, J2a is replaced by J2a* in eq.(^. Also we define = AP{^,g) - Ap(^o,5'o)- 

There exists x* G [0, 1]^' such that 

Hence 

m^,9)\ < C2(||V^7|| +/3||^||||V^||) (56) 
By expanding eq. (l53|) . we have 

m 

k=l 



A:\^o,9o) = E^^^^^f?"!) I dtl dy 



2k' 

xco (log^r-'goiy) t^^^-' e-*+v^«o(^). 

The largest order term among them is ^^"^(^o; fl'o)- We define -B^"^(^ojfl'o from A^"^ by 
replacing the integral region of y, 

(logn)"""^ ^oo 



The difference between ^^™'(^0)5'o) and -Bf'"(^0)fl'o) is smaller than ||5f||e^ll^'l^/^/nP+'*', and 
\AlMo.9.)\ < C3|kl|e-^'l«ll^/^ (l<A:<m), (57) 



Br{io,9o)\ > C3.||^||e-^^llgll^/^ ■ (58 



(logn 



.r-l 



71' 



By the definition. 



D^E'^[u'^'f{um-Ey,[t'f{0,ym 



AO(e,^) i?°™(eo,^o) ■ 

Then using eqs. (|M|) - (|35]) . 

R^{^,g) ^ A%^,g)-Br{^o,9o) 

m 

= Alii, g) + g) + Y. Al\io, 9o) - Briio, 9o), 

k=l 

where AP{i,g) = A'^{i,g) — A'^{C,,g) is the sum over a that are not a*. Therefore 

^ i\\9\\+m\m\\ + iiv^7ii) 
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Thus 



< il^illlyll^TT^ e^^ll^ll^ (Ikll +/?ll^lll|Ve|| + IIV^^II) 
ip nP log n 

which completes the Lemma. (Q.E.D.) 
7.8 Proof of Lemma [8] 

By using partial integration, for an arbitrary a G -R, 

r e-"' 2t^ e^"^ dt = ^ r e-'3*|-f2t" e^-^^) dt. 
Jo pJo dt^ ^ 

Hence 

2A 



r dt {2t -Via-^) t^"' e-'^*+^^'^ = 0. (59) 
Jo p 

which shows Lemma [HI (Q.E.D.) 
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